The Annals of Statistics 

2007, Vol. 35, No. 1, 420-448 

DOI: 10.1214/009053606000001154 

© Institute of Mathematical Statistics, 2007 



CONVERGENCE OF ADAPTIVE MIXTURES OF IMPORTANCE 
SAMPLING SCHEMES 1 

By R. Douc, A. Guillin, J.-M. Marin and C. P. Robert 

Ecole Poly technique, Ecole Centrale Marseille and LATP, CNRS, INRIA 
Futurs, Projet Select, Universite d'Orsay and Universite Paris Dauphine 

and CREST-INSEE 

In the design of efficient simulation algorithms, one is often be- 
set with a poor choice of proposal distributions. Although the per- 
formance of a given simulation kernel can clarify a posteriori how 
adequate this kernel is for the problem at hand, a permanent on- 
line modification of kernels causes concerns about the validity of 
the resulting algorithm. While the issue is most often intractable for 
MCMC algorithms, the equivalent version for importance sampling 
algorithms can be validated quite precisely. We derive sufficient con- 
vergence conditions for adaptive mixtures of population Monte Carlo 
algorithms and show that Rao-Blackwellized versions asymptotically 
achieve an optimum in terms of a Kullback divergence criterion, while 
more rudimentary versions do not benefit from repeated updating. 

1. Introduction. 

1.1. Monte Carlo calibration. In the simulation settings found in opti- 
mization and (Bayesian) integration, it is well documented [20] that the 
choice of the instrumental distributions is paramount for the efficiency of 
the resulting algorithms. Indeed, in an importance sampling algorithm with 
importance function g(x), we are relying on a distribution g that is cus- 
tomarily difficult to calibrate outside a limited range of well-known cases. 
For instance, a standard result is that the optimal importance density for 
approximating an integral 

f(x)7r(x) dx 
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is g*{x) oc \ f(x)\Tr(x) ([20], Theorem 3.12), but this formal result is not 
very informative about the practical choice of g, while a poor choice of g 
may result in an infinite variance estimator. MCMC algorithms somehow 
attenuate this difficulty by using local proposals like random walk kernels, 
but two drawbacks of these proposals are that they may take a long time 
to converge [18] and their efficiency ultimately depends on the scale of the 
local exploration. 

The goals of Monte Carlo experiments are multifaceted and therefore the 
efficiency of an algorithm can be evaluated from many different perspectives. 
In particular, in Bayesian statistics the Monte Carlo sample can be used to 
approximate a variety of posterior quantities. Nonetheless, if we try to as- 
sess the generic efficiency of an algorithm and thus develop a portmanteau 
device, a natural approach is to use a measure of agreement between the 
target and the proposal distribution, similar to the intrinsic loss function 
proposed in [19] for an invariant estimation of parameters. A robust mea- 
sure of similarity ubiquitous in statistical approximation theory [7] is the 
Kullback divergence 



and this paper aims to minimize (£(-7r,7r) within a class of proposals 7f. 

1.2. Adaptivity in Monte Carlo settings. Given the complexity of the 
original optimization or integration problem (which does itself require Monte 
Carlo approximations), it is rarely the case that the optimization of the pro- 
posal distribution against an efficiency measure can be achieved in closed 
form. Even the computation of the efficiency measure for a given proposal 
is impossible in the majority of cases. For this reason, a number of adaptive 
schemes have appeared in the recent literature ([20], Section 7.6.3) in order 
to design better proposals against a given measure of efficiency without re- 
sorting to a standard optimization algorithm. For instance, in the MCMC 
community, sequential changes in the variance of Markov kernels have been 
proposed in [13, 14], while adaptive changes taking advantage of regeneration 
properties of the kernels have been constructed by Gilks, Roberts and Sahu 
[12] and Sahu and Zhigljavsky [23, 24]. From a more general perspective, 
Andrieu and Robert [2] have developed a two-level stochastic optimization 
scheme to update parameters of a proposal towards a given integrated effi- 
ciency criterion such as the acceptance rate (or its difference with a value 
known to be optimal — see Roberts, German and Gilks [21]). However, as 
reflected in this general technical report of Andrieu and Robert [2], the com- 
plexity of devising valid adaptive MCMC schemes is a genuine drawback in 
their extension, given that the constraints on the inhomogeneous Markov 
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chain that results from this adaptive construction are either difficult to sat- 
isfy or result in a fixed proposal after a certain number of iterations. 

Cappe, Guillin, Marin and Robert [3] (see also [20], Chapter 14) devel- 
oped a methodology called Population Monte Carlo (PMC) [16] motivated 
by the observation that the importance sampling perspective is much more 
amenable to adaptivity than MCMC, due to its unbiased nature: using sam- 
pling importance resampling, any given sample from an importance distri- 
bution g can be transformed into a sample of points marginally distributed 
from the target distribution ir and Cappe et al. [3] (see also [8]) showed that 
this property is also preserved by repeated and adaptive sampling. (In this 
setting, "adaptive" is to be understood as a modification of the importance 
distribution based on the results of previous iterations.) The asymptotics 
of adaptive importance sampling are therefore much more manageable than 
those of adaptive MCMC algorithms, at least at a primary level, if only 
because the algorithm can be stopped at any time. Indeed, since every iter- 
ation is a valid importance sampling scheme, the algorithm does not require 
a burn-in time. Borrowing from the sequential sampling literature [11], the 
methodology of Cappe et al. [3] thus aimed at producing an adaptive impor- 
tance sampling scheme via a learning mechanism on a population of points, 
themselves marginally distributed from the target distribution. However, as 
shown by the current paper, the original implementation of Cappe et al. [3] 
may suffer from an asymptotic lack of adaptivity that can be overcome by 
Rao-Blackwellization. 

1.3. Plan and objectives. This paper focuses on a specific family of im- 
portance functions that are represented as mixtures of an arbitrary number 
of fixed proposals. These proposals can be educated guesses of the target 
distribution, random walk proposals for local exploration of the target, non- 
parametric kernel approximations to the target or any combination of these. 
Using these fixed proposals as a basis, we then devise an updating mecha- 
nism for the weights of the mixture and prove that this mechanism converges 
to the optimal mixture, the optimality being defined here in terms of Kull- 
back divergence. From a probabilistic point of view, the techniques used in 
this paper are related to techniques and results found in [4, 6, 17]. In partic- 
ular, the triangular array technique that is central to the CLT proofs below 
can be found in [4, 10]. 

The paper is organized as follows. We first present the algorithmic and 
mathematical details in Section 2 and establish a generic central limit the- 
orem. We evaluate the convergence properties of the basic version of PMC 
in Section 3, exhibiting its limitations, and show in Section 5 that its Rao- 
Blackwellized version overcomes these limitations and achieves optimality 
for the Kullback criterion developed in Section 4. Section 6 illustrates the 
practical convergence of the method on a few benchmark examples. 
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2. Population Monte Carlo. The Population Monte Carlo (PMC) algo- 
rithm introduced in [3] is a form of iterated sampling importance resampling 
(SIR). The appeal of using a repeated form of SIR is that previous samples 
are informative about the connections between the proposal (importance) 
and the target distributions. We stress from the outset that this scheme has 
very few connections with MCMC algorithms since (a) PMC is not Marko- 
vian, being based on the whole sequence of simulations and (b) PMC can 
be stopped at any time, being validated by the basic importance sampling 
identity ([20], equation (3.9)) rather than by a probabilistic convergence re- 
sult like the ergodic theorem. These features motivate the use of the method 
in setups where off-the-shelf MCMC algorithms cannot be of use. We first 
recall basic Monte Carlo principles, mostly to define notation and to make 
precise our goals. 

2.1. The Monte Carlo framework. On a measurable space (Q,A), we are 
given a target, that is, a probability distribution ir on (£l,A). We assume that 
7r is dominated by a reference measure /i, ir <C /x, and also denote by 7r(dx) = 
ir(x)(i(dx) its density. In most settings, including Bayesian statistics, the 
density tt is known up to a normalizing constant, tt{x) oc tt{x). The purpose 
of running a simulation experiment with the target it is to approximate 
quantities related to it, such as intractable integrals 



but we do not focus here on a specific quantity vr(/). In this setting, a stan- 
dard stochastic approximation method is the Monte Carlo method, based 
on an i.i.d. sample x±, . . . ,xn simulated from n, that approximates 7r(/) by 



which almost surely converges to vr(/) (as N goes to infinity) by the law of 
large numbers (LLN). The central limit theorem (CLT) implies, in addition, 
that if 7r(/ 2 ) = / f 2 (x)ir(dx) < oo, then 



where Y n (f) = 7r([/ — ir(f)] 2 ). Obviously, this approach requires a direct 
i.i.d. simulation from tt (or tt), which often is impossible. An alternative ([20], 
Chapter 3) is to use importance sampling, that is, to choose a probability 
distribution v <C \x on (Q,A) called the proposal or importance distribution, 
the density of which is also denoted by and to estimate ir(f) by 




N 



^ C (f)=N- 1 J2f(x i ) 



V^V{^ C (/)-vr(/)}-AA(0,V^(/)) 
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If 7r is also dominated by v, 7r <C v, then 7r* almost surely converges to 
7r(/) and if u{f 2 {'K /v) 2 ) < 00, then the CLT also applies, that is, 

V^SvCf) - tt(/)} 3 (0, V, ) . 

As the normalizing constant of the target distribution tt is unknown, it is not 
possible to directly use the IS estimator 7t^ N (f). A convenient substitute is 
the self-normalized IS estimator, 

/ N \ - 1 TV 

^ IS (/) = EfrM**) E/(^)(^)(^), 

\i=l / i=l 

which also converges almost surely to 7r(/). If + f 2 ){rr /v) 2 ) < 00, then 
the CLT applies: 

V^{<T(/) " *(/)} « A^(0, ¥,{[/ - tt(/)] ^}) . 

Obviously, the quality of the IS and SNIS approximations strongly depends 
on the choice of the proposal distribution v, which is delicate for complex 
distributions like those that occur in high-dimensional problems. (While 
multiplication of the number of proposals may offer some reprieve in this 
regard, we must stress at this point that our PMC methodology also suffers 
from the curse of dimensionality to which all importance sampling methods 
are subject in the sense that high-dimensional problems require a consider- 
able increase in computational power.) 

2.2. Sampling importance resampling. The sampling importance resam- 
pling (SIR) method of Rubin [22] is an extension of the IS method that 
achieves simulation from tt by adding resampling to simple reweighting. 
More precisely, the SIR algorithm operates in two stages. The first stage is 
identical to IS and consists in generating an i.i.d. sample {x\, . . . , xn) from v. 
The second stage builds a sample from tt, (x±, . . . , xm), based on the instru- 
mental sample (xi, . . . , xn), by resampling. While there are many resampling 
methods ([20], Section 14.3.5), the most standard (if least efficient) approach 
is multinomial sampling from {x\, . . . ,xn} with probabilities proportional 
to the importance weights [^(xi), . . . , ^(xjv)], that is, the replacement of 
the weighted sample {x\ , . . . , xn) by an unweighted sample (x±, . . . ,xm), 
where Xi = xj i (1 < i < M) and where (J±, . . . , Jm) ~ M.(M, g±, . . . , qn), the 
multinomial distribution with probabilities 

F[J e = i\x 1 ,...,x N ] = Qi(x ^(xi), 1 < i < N, 1 <£< M. 
The SIR estimator of vr(/) is then the standard average 

M 
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which also converges to 7r(/) since each Xj is marginally distributed from 
7r. By construction, the variance of the SIR estimator is greater than the 
variance of the SNIS estimator. Indeed, the expectation of nf}]^ M (f) condi- 
tional on the sample (x\, . . . ,xjv) is equal to (/)■ Note, however, that 
an asymptotic analysis of uif) ls quite delicate because of the depen- 
dencies in the SIR sample (which, again, is not an i.i.d. sample from tt). 

2.3. The population Monte Carlo algorithm. In their generalization of 
importance sampling, Cappe et al. [3] introduce an iterative dimension in 
the production of importance samples, aimed at adapting the importance 
distribution v to the target distribution tt. Iterations are then used to learn 
about 7r from the (poor or good) performance of earlier proposals and that 
performance can be evaluated using different criteria, as, for example, the 
entropy of the distribution of importance weights. 

More precisely, at iteration t (t = 0, 1, . . . ,T) of the PMC algorithm, a 
sample of N values from the target distribution tt is produced by a SIR 
algorithm whose importance function ut depends on t, in the sense that vt 
can be derived from the N x (t — 1) previous realizations of the algorithm, 
except for the first iteration t = 0, where the proposal distribution vq is 
chosen as in a regular importance sampling experiment. A generic rendering 
of the algorithm is thus as follows: 

PMC algorithm. At time t = 0, 1, . . . , T, 

1. generate (xij)i<i<N by i.i.d. sampling from ut an d compute the normal- 
ized importance weights u>i t t; 

2. resample (xi t t)i<i<N from (xi t t)i<i<N by multinomial sampling with weights 

3. construct based on (xj >r , Wi iT )i<i<7V,o<T<i- 

At this stage of the description of the PMC algorithm, we do not give 
in detail the construction of ut, which can thus arbitrarily depend on the 
past simulations. Section 3 studies a particular choice of the zVs in detail. 
A major finding of Cappe et al. [3] is, however, that the dependence of u t 
on earlier proposals and realizations does not jeopardize the fundamental 
importance sampling identity. Local adaptive importance sampling schemes 
can thus be chosen in much wider generality than was previously thought 
and this shows that, thanks to the introduction of a temporal dimension 
in the selection of the importance function, an adaptive perspective can be 
adopted at little cost, with a potentially large gain in efficiency. 

Obviously, when the construction of the proposal distribution vt is com- 
pletely open, there is no guarantee of permanent improvement of the simu- 
lation scheme, whatever the criterion adopted to measure this improvement. 



ADAPTIVE MIXTURES FOR IMPORTANCE SAMPLING 



7 



For instance, as illustrated by Cappe et al. [3], a constant dependence of u t 
on the past sample quickly leads to stable behavior. A more extreme illus- 
tration is the case where the sequence (u t ) degenerates into a quasi-Dirac 
mass at a value based on earlier simulations and where the performance of 
the resulting PMC algorithm worsens. In order to study the positive and 
negative effects of the update of the importance function vt, we hereafter 
consider a special class of proposals based on mixtures and study two par- 
ticular updating schemes, one in which improvement does not occur and one 
in which it does. 

3. The _D-kernel PMC algorithm. In this section and the following ones, 
we study adaptivity for a particular type of parameterized PMC scheme, in 
the case where Vt is a mixture of measures Cd (1 < d < D) that are chosen 
prior to the simulation experiment, based either on an educated guess about 
7r or on local approximations (as in nonparametric kernel estimation). The 
dependence of the fj's on the past simulations is of the form 

D 

v t (dx) = a d Cd({xi,t-i,^i,t-i}i<i<N,dx) 
d=l 
D N 

= ^2 a d ^^j,t-iQd(xj,t-i,dx), 

d=l j=l 

where the Qi t's denote the importance weights and the transition kernels Qd 
(1 <d< D) are given. This situation is rather common in MCMC settings 
where several competing transition kernels are simultaneously available, but 
difficult to compare. For instance, the cycle and mixture MCMC schemes 
discussed by Tierney [25] are of this nature. 

Hereafter, we thus focus on adapting the weights {a d )i<d<D toward a 
better fit with the target distribution. A natural approach to updating the 
weights (a d )i<d<D is to favor those kernels Qd that lead to a high acceptance 
probability in the resampling step of the PMC algorithm. We thus choose 
the a l d s to be proportional to the survival rates of the corresponding Q^'s in 
the previous resampling step [since using the double mixture of the measures 
Qd(xj t t-i,dx) means that we first select a point Xj t—i in the previous sample 
with probability (Da—i, then select a component d with probability ot d and, 
finally, simulate from Q d (xjj-i,dx)]. 

3.1. Algorithmic details. The family (Qd)i<d<D of transition kernels on 
x A is such that (Qd(x, -))i<d<D, xen is dominated by the reference mea- 
sure \x introduced earlier. As above, we set the corresponding target density 
and the transition kernel to be tt and qd(-, •) (1 < d < D), respectively. 

The associated PMC algorithm then updates the proposal weights (and 
generates the corresponding samples from tt) as follows: 
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D-KERNEL PMC ALGORITHM. At time 0, 

(a) generate x^o Vq (1 < i< N) and compute the normalized importance 
weights LV ifi oc {ir/v }(x ifi ); 

(b) resample (xi,o)i<i<7V into (ij,o)i<j<7V by multinomial sampling 

(Ji,o)i<i<N ~M(N, (^i,o)i<i<Jv), that is, x ifi = xj i0)0 ; 

(c) set a 1 / = l/D (l<d<D). 
At time t = 1, . . . ,T, 

(a) select the mixture components {Ki tt )i<i<N ^ A4(N, {a^ A T )i<d<D)', 

(b) generate independent x iit ~ t {x i t _i,x) (1 < i < N) and compute the 
normalized importance weights Q^t oc 7r(xij)/qKi t (^i,t-l ; ^i,t); 

(c) resample {xi^)\<i<N into {xi t t)\<i<N by multinomial sampling with weights 

{&i,t)l<i<N)', 

(d) set 4 +1,JV = Eili^,A(^,t) (1 < d< -D)- 

In this implementation, at time t > 1, step (a) chooses a kernel index d in 
the mixture for each point of the sample, while step (d) updates the weight 
ad as the relative importance of kernel Qd in the current round or, in other 
words, as the relative survival rate of the points simulated from kernel Qd- 
(Indeed, since the survival of a simulated value Xij is driven by its impor- 
tance weight Qi t, reweighting is related to the respective magnitudes of the 
importance weights for the different kernels.) Also, note that the resampling 
step (c) is used to avoid the propagation of very small importance weights 
along iterations and the subsequent degeneracy phenomenon that plagues 
iterated IS schemes like particle filters [11]. At time t = T, the algorithm 
should thus stop at step (b) for integral approximations to be based on the 
weighted sample {xi^T-,oji,T)i<i<N ■ 

3.2. Convergence properties. In order to assess the impact of this update 
mechanism on the performance of the PMC scheme, we now consider the 
convergence properties of the above algorithm when the sample size N grows 
to infinity. Indeed, as already pointed out in Cappe et al. [3], it does not make 
sense to consider the alternative asymptotics of the PMC scheme (namely, 
when T grows to infinity), given that this algorithm is intended to be run 
with a small number T of iterations. 

A basic assumption on the kernels Qd is that the corresponding impor- 
tance weights are almost surely finite, that is, 

(Al) We{l,...,D} 7 r® 7 r{q d ( x ,x')=0}=0, 

where £ (g) C denotes the product measure, that is, £,<g>((A x B) = J^ x b^(^ x ) x 
((dy). We denote by 7„ the uniform distribution on {!,...,£)}, that is, 
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7u(k) = 1/D for all k G {1, . . . ,D}. The following result (whose proof is 
given in Appendix B) is a general LLN on the pairs (xij,Kij) produced by 
the above algorithm. 

Proposition 3.1. Under (Al), for any it (g> j u -measurable function h 
and every t G N, 

N 

i=i 

Note that this convergence result is more than we need for Monte Carlo 
purposes since the JQ ; t's are auxiliary variables that are not relevant to the 
original problem. However, the consequences of this general result are far 
from negligible. First, it implies that the approximation provided by the 
algorithm is convergent, in the sense that J2i=i&i,tf( x i,t) is a convergent 
estimator of 7r(/). But, more importantly, it also shows that for t > 1, 

N 

a d =^^i,t^d{K it t) 1/D. 

i=l 

Therefore, at each iteration, the weights of all kernels converge to 1/D when 
the sample size grows to infinity. This translates into a lack of learning 
properties for the D-kernel PMC algorithm: its properties at iteration 1 and 
at iteration 10 are the same. In other words, this algorithm is not adaptive 
and only requires one iteration for a large value of N. We can also relate 
this to the fast stabilization of the approximation in [3]. (Note that a CLT 
can be established for this algorithm, but given its unappealing features, we 
leave the exercise for the interested reader.) 

4. The Kullback divergence. In order to obtain a correct and adaptive 
version of the D-kernel algorithm, we must first choose an effective criterion 
to evaluate both the adaptivity and the approximation of the target distri- 
bution by the proposal distribution. We then propose a modification of the 
original D-kernel algorithm that achieves efficiency in this sense. 

As argued in many papers, using a wide range of arguments, a natural 
choice of approximation metric is the Kullback divergence. We thus aim to 
derive the D-kernel mixture that minimizes the Kullback divergence between 
this mixture and the target measure 7r, 

(1) //'° 8 ( M x )T^t2( x J ^X"^')- 

J J \TT{X) l^d=l a dQd\X, X ) / 

This section is devoted to the problem of finding an iterative choice of mixing 
coefficients ad that converge to this minimum. A detailed description of the 
optimal PMC scheme is given in Section 5. 
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4.1. The criterion. Using the same notation as above, in conjunction 
with the choice of weights a d in the D-kernel mixture, we introduce the 
simplex of 

y = ja = (a%, . . .,a D ); Vd G {1, . . . ,D},a d > and ^oy = 1 j, 

and denote n = tt <g> tt. We now assume that the D kernels also satisfy the 
condition 

Vde{l,...,D\ 

(A2) 

E^[|log^(X,X / )|] = Jl \logq d (x,x')\7t(dx,dx') 



< oo, 



which is automatically satisfied when all ratios tt jq d are bounded. From the 
Kullback divergence, we derive a function on 5? such that for a G 5? , 



£n(a) = JJ log^2a d q d (x,x')ir(dx,dx') =E 

d=l 



D 



logJ2a d qd(X,X') 



d=l 



By virtue of Jensen's inequality, £n(a) < J 7r(dx)logir(x) for all a G 5? . 

Note that due to the strict concavity of the log function, is a strictly 
concave function on a connected compact set and thus has no local maximum 
besides the global maximum, denoted a max . Since 

J log 7T(x)7r(dx) - £^{a) = E* ^ log 7t(X)tt(X') /tt(X) j £ a^(X, X) j^, 

Q max - g p^ ma i vector of weights for a mixture of transition kernels 
such that the product of tt by this mixture is the closest to the product 
distribution tt. 



4.2. A maximization algorithm. We now study an iterative procedure, 
akin to the EM algorithm, that updates the weights so that the function 
£n{Gt) increases at each step. Defining the function F on 5? as 



F(a) = E* 



D 



a d q d (X,X') Y, a M X i X> ) 

' 3=1 



Kd<D 



we construct the sequence (a*)t>i on such that a 1 = (1/D, ... ,1/ D) and 
(yt+i = F(of) for t > 1. Note that under assumption (Al), for all t > 0, 



E^g^XO/^c^CX,^ >0 
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and thus for all t > and 1 < d < D, we have a d > 0. If we define the 
extremal set 

Ijj = ja G y-,\/d G {1, . . . , D}, either a d = or 



we then have the following fixed-point result: 

Proposition 4.1. Under (Al) and (A2), 

(i) £n ° F — £jt is continuous; 

(ii) /or ae^^o F(a) > £ # (a); 

(iii) Id = {a G J^;F(a) = a} = {a G o5^;<5^ o F(a) = S^(a)} and Id is 
finite. 

Proof. £^ is clearly continuous. Moreover, by Lebesgue's dominated 
convergence theorem, the function a ^ E^a^g^A, A')/EjLi ctjQj{X, A')) 
is also continuous, which implies that F is continuous. This completes the 
proof of (i). Due to the concavity of the log function, 



£*(F{a))-£*(a) 



= E*| log 

>E # 



D 



^ E?=i «^(A, A') " VEjii a jQj(X, X' 



Qd(X,X') 



V> a d q d (X,X\ 



%(A,A') 



E? =1 ^g,ix,*') 



%(A,A') 



E? = i«^(a,a'; 



logEs 



q d (X,X>) 



Ylf=i a jQj(X, X') 



Applying the inequality ulogn > u — 1 yields (ii). Moreover, the equality 
in ulogu > u — 1 holds if and only if u = 1. Therefore, equality above is 
equivalent to 

Va d / E* (^(A, A') /Y t a j q j {X, X'^j = 1. 

Thus, Z£> = {a G £^ ° F(a) = £^{a)}. The second equality, Zd = {a G «5^; 
^(a) =a}, is straightforward. 

We now prove by recursion on D that Id is finite, which is equivalent to 
proving that 

{aG I D \ a d ^0We{l,...,D}} 
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is empty or finite. If this set is nonempty, then any element a in this set 
satisfies 

W 6 {l,...,fl> E,( y ,, )=l, 
which implies that 

= V a max f E f grfC^",^" 7 ) 

Ef = i« igj (x,A-) y 



> E ,f log %^^)U . 

- v Ef =1 a i37 -(x,x') 

Since the global maximum of ^ is unique, we conclude that a = a max and 
hence (iii) follows. □ 

4.3. Averaged EM. Proposition 4.1 implies that our recursive procedure 
satisfies 8^{a t+l ) > £^{a t ). Therefore, the Kullback divergence criterion (1) 
decreases at each step. This property is closely linked to the EM algorithm 
([20], Section 5.3). More precisely, consider the mixture model 

V~M(1, {ai,..., a D }) and W = (X,X')\V ~ %{dx)Q v (x, dx 1 ) 

with parameter a. We denote by E a the corresponding expectation, by 
p a (v,w) the joint density of (V, W) with respect to (j, <8) £t an d by p a (w) 
the density of TV with respect to /x. It is then easy to check that £^{a) = 
J \og(p a (w))Tt (dw) , which is an average version of the criterion to be maxi- 
mized in the EM algorithm when only W is observed. In this natural 
idea, adapted from the EM algorithm, is to update a according to the iter- 
ative scheme 

q <+1 = argmax / F la t[\ogp a (V,w)\w]ir(dw). 

ae,y J 

Straightforward algebra can be used to show that this definition of a t+1 
is equivalent to the update formula a t+1 = F(a t ) that we used above. Our 
algorithm then appears as an averaged EM, but shares with EM the property 
that the criterion increases deterministically at each step. 

The following result ensures that any a different from a max is repulsive. 

Proposition 4.2. Under (Al) and (A2), for every a / a max £ y, there 
exists a neighborhood V a of a such that if a t( > £ V a , then (a*)t>t leaves V a 
within a finite time. 
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Proof. Let a 7^ a max . Then using the inequality u — 1 > logrt, we have 

V^maxTjj, ( Qd{X,X') \ 

d=i \12j=iOijqj(X,X')J 



J2j=iatjqj(X, X' 
which implies that there exists 1 < d < D such that 



E^q d {X,X') j/j2 a M X > X ')^J >L 



Using a nonincreasing sequence (W n ) n >o of neighborhoods of a in oS^, the 
monotone convergence theorem implies that 

1<J | = J h inf 



Ef=i (X,X')J~ * Ptw n J2f =1 p jQj (X, X>) 
< Hm inf 

Thus, there exist W no =V a , & neighborhood of a and 77 > 1 such that 



(2) v/3 g y Q e # ^(x,x') /y^m&x')^ > n. 

Now, for all f > and 1 < d < D, we have 1 > a l d > 0. Combining (2) with 
the update formulas for a 1 = F(a t ~ 1 ) shows that (a*)t>o leaves V a within a 
finite time. □ 

We thus conclude that the maximization algorithm is convergent, as as- 
serted by the following proposition: 

Proposition 4.3. Under (Al) and (A2), 

lim a 1 = a max . 

t— >oo 

Proof. First, recall that Id is a finite set and that a max Glj), that is, 
l D = {Po,/3i,...,p I } with /?o = a max . We introduce a sequence (Wi)o<i<i 
of disjoint neighborhoods of the Pi's such that for all < i < 7, -F(Wj) is 
disjoint from [jj^Wj [this is possible since F(fti) = Pi and F is continuous] 
and for all i G {1, ... ,7} , Wj C Vg i5 where the (VaJ's are defined as in the 
proof of Proposition 4.2. 

The sequence (a*))f>o is upper-bounded and nondecreasing; therefore 
it converges. This implies that lim^oo 8^ o F(a t ) — £^{a t ) = 0. By continuity 
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of o F — there exists T > such that for all t>T, at £ Uj Wj. Since 
F{Wi) is disjoint from Uj^i Wj, this implies that there exists i £ {0, ...,/} 
such that for all £ > T, a* £ Wj. By Proposition 4.2, i cannot be in {1, . . . , /}. 
Thus, for all t > T, a 1 £ W , which is a neighborhood of (3q = a max . □ 

5. The Rao Blackwellized D-kernel PMC. Update of the weights a& 
through the transform F thus improves the Kullback divergence criterion. 
We now discuss how to implement this mechanism within a PMC algorithm 
that resembles the previous D-kernel algorithm. The only difference with 
the algorithm of Section 3.1 is that we make use of the kernel structure in 
the computation of the importance weight. In MCMC terminology, this is 
called "Rao-Blackwellization" ([20], Section 4.2) and it is known to provide 
variance reduction in data augmentation settings ([20], Section 9.2). In the 
current context, the improvement brought about by Rao-Blackwellization 
is dramatic, in that the modified algorithm does converge to the proposal 
mixture that is closest to the target distribution in the sense of the Kullback 
divergence. More precisely, a Monte Carlo version of the update via F can 
be implemented in the iterative definition of the mixture weights, in the 
same way that MCEM approximates EM ([20], Section 5.3.3). 

5.1. The algorithm. In importance sampling, as well as in MCMC set- 
tings, the (de) conditioning improvement brought about by Rao-Blackwellization 
may be significant [5]. In the case of the D-kernel PMC scheme, the Rao- 
Blackwellization argument is that it is not necessary to condition on the 
value of the mixture component in the computation of the importance weight 
and that the improvement is brought about by using the whole mixture. The 
importance weight should therefore be 

, D 

K(xi,t) /Yl a d Qd{5;i,t-l,Xi,t) rather than it(x^ t ) /q Klit {x i)t -i, x ijt ), 
' d=l 

as in the algorithm of Section 3.1. As already noted by Hesterberg [15], 
the use of the whole mixture in the importance weight is a robust tool 
that prevents infinite variance importance sampling estimators. In the next 
section, we show that this choice of weight guarantees that the following 
modification of the D-kernel algorithm converges to the optimal mixture (in 
terms of Kullback divergence). 

Rao-Blackwellized D-kernel PMC algorithm. At time 0, use 
the same steps as in the D-kernel PMC algorithm to obtain (xj j o)i<i<Af and 
set afe N = l/D (l<d<D). 

At time t = l,...,T, 

(a) generate (JQ,t)i<»<jv ~ M(N, (afe )i< d < D ); 
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(b) generate independent x^t ~ qKi t (%i,t-i,%) (1 < * < -^0 an d compute the 

normalized importance weights u^t oc ir(xij)/J2d=i at d, N Qd(xi,t-i, x^t); 

(c) resample (xi,t) into (xj ) t)i<j<jv by multinomial sampling with weights 
(^i,t)l<i<Jv); 

(d) set aj 1 " 1 '* = Eili ^,A(#i,t) (1 < d < D). 

5.2. T/ie corresponding law of large numbers. Not very surprisingly, the 
sample obtained at each iteration of the above Rao-Blackwellized algorithm 
approximates the target distribution in the sense of the weak law of large 
numbers (LLN). Note that the convergence holds under the very weak as- 
sumption (Al) and for any test function h that is absolutely integrable with 
respect to the target distribution tt. The function h may thus be unbounded. 

Theorem 5.1. Under (Al), for any function h in L\ and for all t > 0, 

N ^ N 

^2u>i,th(x it t) -^ir(h) and — ^ h(x i)t ) -^p n(h). 

i=l i=l 

Proof. First, convergence of the second average follows from the con- 
vergence of the first average by Theorem A.l, since the latter is simply 
a multinomial sampling perturbation of the former. We thus focus on the 
weighted average and proceed by induction on t. The case t = is the basic 
importance sampling convergence result. For t > 1, if the convergence holds 
for t — 1, then to prove convergence at iteration t, we need only check, as in 
Proposition 3.1, that 

1 N N oo 

— ^2uj i: th(x i: t) -^wir(h), 

i=l 

where u^t denotes the importance weight Tr(xi i t)/J2d=i a d N ^(xi^-ijXi^)- 
(The special case h = 1 ensures that the renormalizing sum converges to 1.) 

We apply Theorem A.l with Q N = a{{x iit -i)i<i<N, (ofe )i<<kd) and U N>i = 
N~ 1 uii ) th{xij)- Then conditionally on Qn, the a^t's (1 < i < N) are indepen- 
dent and 

D 

Xi,t\GN ~ X! aJ N Qd{xi,t-i, •)■ 

d=l 



Noting that 



N 



N 



N 



E 



7r(x itt )h(x ir 



\ AT V— V D t,N (~ \ 

KN T,d=l a d Qd[Xi,t-l,Xi t t) 
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we only need to check condition (in). The end of the proof is then quite 
similar to the proof of Proposition 3.1. □ 

5.3. Convergence of the weights. The next proposition ensures that at 
each iteration, the update of the mixture weights in the Rao-Blackwellized 
algorithm approximates the theoretical update obtained in Section 4.2 for 
minimizing the Kullback divergence criterion. 

Proposition 5.1. Under (Al), for all t>l, 

(3) Vl<d<D ctf^ai 

where a 1 = F(a t ~ 1 ). 

Combining Proposition 5.1 with Proposition 4.3, we obtain that, under 
assumptions (A1)-(A2), the Rao-Blackwellized version of the PMC algo- 
rithm adapts the weights of the proposed mixture of kernels, in the sense 
that it converges to the optimal combination of mixtures with respect to the 
Kullback divergence criterion obtained in Section 4.2. 

PROOF of Proposition 5.1. The case t= 1 is obvious. Now, assume 
(3) holds for some t > 1. As in the proof of Proposition 3.1, we now establish 
that 

N -, N 



i=i 

the convergence of the renormalizing sum to 1 being a consequence of this 
convergence. We apply Theorem A.l with Q N = (r({xi,t-i)i<i<N , (ot d ,N )i<d<D) 
and U N)i = iV _1 a;» )t Id(iQ )t ). Conditionally on Q N , the (if i)t ,x i)t )'s (1 < i < 
N) are independent and for all (d,A) in {1, . . . , D} x A, we have 

F(Ki, t = d,x iit £ A\g N ) = JfQd{xi, t -\, A). 

To apply Theorem A.l, we need only check condition (iii). We have, for 
C>0, 



/ N 

a £ 

\i=l 



H^,tMKi, t )>c} 
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E* 

3=1 



irix) 



D 1 qj(x',x) 



by the LLN of Theorem 5.1. The right-hand side converges to as C tends to 
infinity since by assumption (Al), 7t({qj(x, x') = 0}) = 0. Thus, Theorem A.l 
applies and 



1 N / j N 

1=1 \ i=l 



0. 



To complete the proof, it simply remains to show that 

JV 



(4) 



/ 1 N 



1 



E 



t,N ,~ \ 



-ir(dx 



N-*po t+l 



N t=t J HiLia]'" qi{x itt -i,x) 
It follows from the LLN stated in Theorem 5.1 that 



1 



N 



(5) ^EH^w? 



i=l 



ot d qd(xi,t-i,x) n- 



Y,i=ia t l qi(x it t-i,x) 



a d q d {X,X') 
Y.?=ia\qi{X,X') 



a 



t+l 



and it thus suffices to check that the difference between (4) and (5) converges 
to in probability. To show this, first note that for all t > 1 and all 1 < 
d < D, a, > and thus, by the induction assumption, for all 1 < d < D, 



I t,N t \ l t N^oo 

K - a d)l a d 



P 0. Using the inequality 1 4 — -S | < 1 4 



A\\D-B\ , I A-C\\C | 



BW D 



we have, by straightforward algebra, that 

a d q d (x it t-i,x) 



C WD\i 



t,N ,~ \ 
a d ' q d (x it t-i,x) 



J2{Li a{ N qi{x i: t-i-,x) YaL\ oc\qi(xi,t-i,x 



< 



t,N ;~ \ 

u d qd(xi, t -i,x) 



T,?=i®i N qj(xi,t-i,x) v 



sup 



t,N 



a, 



+ 



a 



t,N 



a:, 



a d q d (x i: t-i,x) 



< 2 sup 
ie{i,...,£»} 



t,N 
a. 



ot) 



The proposition then follows from the convergence {pt d N — a d )/a 
□ 



0. 
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5.4. A corresponding central limit theorem. We now establish a CLT for 
the weighted and the unweighted samples when the sample size goes to infin- 
ity. As noted in Section 2.2 for the SIR algorithm, the asymptotic variance 
associated with the unweighted sample is larger than the variance of the 
weighted sample because of the additional multinomial step. 

Theorem 5.2. Under (Al), 

(i) for a function h such that Tt{h 2 (x')Tr(x) /qd(x,x')} < oo for at least 
one 1 < d < D, we have 

_ N 

(6) VNY, Ui,t{h{xi, t ) - 7r(h)} Af(0, of), 

i=l 

where of = Tt({h(x') - vr(/i)} 2 vr(x / )/EdLi a d q d (x, x')); 
if, moreover, ir{h 2 ) < oo, then 

1 N 

(7) Y,ih&,t) ~ Ah)} AA(0, a 2 + V w (h)). 

Note that amongst the conditions under which this theorem applies, the 
integrability condition 



(8) TffftVl^U 

V q d (x,x')J 



< oo 



is required for some d in {1, . . . , D} and not for all cfs. Thus, settings where 
some transition kernels %(•,■) do not satisfy (8) can still be covered by this 
theorem provided (8) holds for at least one particular kernel. An equivalent 
expression of the asymptotic variance a 2 is 



o~t = ({h — Tr(h)}—\ where v(dx, dx) = n(dx) ( ^ a^Qdix, dx') j . 

\ d=l / 



Written in this form, a 2 turns out to be the expression of the asymptotic 
variance that appears in the CLT associated with a self-normalized IS algo- 
rithm (SNIS) (see Section 2.1 for a description of the algorithm), where the 
proposal distribution would be v and the target distribution 7f. Obviously, 
this SNIS algorithm cannot be implemented since, given the above defini- 
tion of v, the proposal distribution depends on both ir and the weights (a d ), 
which are unknown. 



Proof of Theorem 5.2. Without loss of generality, we assume that 
n(h) = 0. Let do G {1, . . . } D} be such that jt{h 2 (x')ir{x) / qd {x,x')) < oo. 
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A consequence of the proof of Theorem 5.1 is that YaLi u i,t — — >f 1, so we 
only need to prove that 

1 N 

(9) j2^M^t)^m,^)- 

* i=l 

We will apply Theorem A. 2 with 

tt 1 ui \ 1 n(xi t t)h(x itt ) 

vA VJV^aj qd(xi,t-i,Xi,t) 

Qn = o"{(5t,t-i)i<i<jv, (a* ! ' JV ')i<£i<£»)}- 

Conditionally on (yjv> the (xi^)'s (1 < £ < A) are independent and 

D 

Xi,t\GN ~ X! a d N Qd{%i,t-l, •)■ 

d=l 

Conditions (i) and (ii) of Theorem A. 2 are clearly satisfied. To check condi- 
tion (iii), first note that E(£/jv,j|£w) = 7r(/t) = 0. Moreover, 

N 

A N =Y J nu 2 N , l \Q N ) 

8=1 



^E/ fr 2 (* U g J (x) r r*(*0- 



A . 



d=l&d 1d{Xi,t-l,X) 

By the LLN for (xi t t) stated in Theorem 5.1, we have 



7r(x) . . jV-»oo 2 



«d9d(5i,t-i,a;) 



ir(dx) 



To prove that (iii) holds, it is thus sufficient to show that \Bn — An\ I -^rp 0. 
Since a do > 0, we need only consider the upper bound 

s\ (»d-®d N \ l ^f h 2 (x)n(x) \ . N ^oo n 
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Thus, condition (iii) is satisfied. Finally, we consider condition (iv). Using 
the same argument as was used for condition (iii), we have that 



I 



N 

r t,i\r _i j , ) E 

1 d o d o' i=1 

N 



7r(x itt )h(xij) 



>C 



>N 



— > ¥ 7t\h {x)^zr c 



ir(x) 



ir(x)h(x) 



r- > C \ir(dx) 



7r(x)h(x) 



a do qd (x',x) I 2 1 a t do q do (x', x) 



>C 



which converges to as C tends to infinity. Thus, Theorem A. 2 can be 
applied and the proof of (6) is completed. The proof of (7) follows from a 
direct application of Theorem A. 2, as in the SIR result, by setting {7jv,i = 
-±=h(x it t) and Q N = a{(x ijt )i<i<N, (^i,t)i<i<iv}- □ 



6. Illustrations. In this section, we briefly show how the iterations of 
the PMC algorithm quickly implement adaptivity toward the most efficient 
mixture of kernels, through three examples of moderate difficulty. (The R 
programs are available from the last author's website via an Snw file.) 



Example 1. As a first toy example, we take the target tt to be the 
density of a five-dimensional normal mixture, 

(10) ]T^ 5 (0,£ 4 ), 

1=1 

and an independent normal mixture proposal with the same means and 

2 ./V 

variances as (10), but started with different weights a d ' . Note that this is 
a very special case of a D-kernel PMC scheme in that the transition kernels of 
Section 5.1 are then independent proposals. In this case, the optimal choice 
of weights is obviously a* d = 1/3. In our experiment, we used three Wishart- 
simulated variances Xj, with 10, 15 and 7 degrees of freedom, respectively. 
The starting values a d ' N are indicated on the left of Figure 1, which clearly 
shows the convergence to the optimal values 1/3 and 2/3 for the two first 
accumulated weights in less than ten iterations. (Generating more simulated 
points at each iteration stabilizes the convergence graph, but the primary 
aim of this example is to exhibit the fast convergence to the true optimal 
values of the weights.) 
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Example 2. As a second toy example, consider the case of a three- 
dimensional normal ^4^(0, E) target with covariance matrix 

/6.986 0.154 3.523 \ 
E = 0.154 15.433 3.528 
\3.523 3.528 18.463/ 

and the mixture of three kernels given by 

(11) a*' ^(x i)t _i,Ei) + JV (x i)t -\^2) «^(£i,t-i,E 3 ), 

where the Ej's are random Wishart matrices. [The first proposal in the 
mixture is a product of unidimensional Student i-distributions with two 
degrees of freedom, centered at the current values of the components of 
Xij-i and rotated by ^[r\ diag(E 1//2 ).] 

A compelling feature of this example is that we can visualize the Kullback 
divergence on the M 3 simplex in the sense that the divergence 




10 20 30 40 50 

Iterations 

Fig. 1. Convergence of the accumulated weights a{ N and a*' +c4' for the three-com- 
ponent normal mixture to the optimal values 1/3 and 2/3 (represented by dotted lines). At 
each iteration, N — 1,000 points were simulated from the D-kernel proposal. 
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o 




0.0 0.2 0.4 0.6 0.8 1.0 



Fig. 2. Grey level and contour representation of the Kullback divergence t(ct\,a.2,ctz) 
between the j9z(Q, E) distribution and the three- component mixture proposal (11). {The 
darker pixels correspond to lower values of the divergence.) We also represent [in white) 
the path of one run of the D-kernel PMC algorithm when started from a random value 
(ai, 02,0:3). The number of iterations T is equal to 500, while the sample size N is 50,000. 

can be approximated by a Monte Carlo experiment on a grid of values of 
(«i , «2 ) • Figure 2 shows the result of this Monte Carlo experiment based on 
25,000 ./^(O, S) simulations and exhibits a minimum divergence inside the 
M 3 simplex. Running the Rao-Blackwellized D-kernel PMC algorithm from 
a random starting weight (01,02,03) always leads to a neighborhood of the 
minimum, even though a strict decrease in the divergence requires a large 
value for N and a precise convergence to a max necessitates a large number 
of iterations T. 



Example 3. Our third example is a contingency table inspired by [1], 
given here as Table 1. We model this dataset by a Poisson regression, 

x ij ~ ^ 2> (exp(a i + £,.7=0,1, 

with oq = for identifiability reasons. We use a flat prior on the param- 
eter 9 = (ai, /3q, (3i) and run the PMC D-kernel algorithm with a mix- 
ture of ten normal random walk proposals, ^V(6^t-x, Qd.1 '{&))■, d = 1, . . . , 10, 
where 1(9) is the Fisher information matrix evaluated at the MLE, 9 = 
(—0.43,4.06,5.9), and where the scales vary from 1.35e— 19 to 1.54e+07 
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(the Qd's are equidistributed on a logarithmic scale). The results of five (suc- 
cessive) iterations of the Rao-Blackwell D-kernel algorithm are as follows: 
unsurprisingly, the largest variance kernels are hardly ever sampled, but 
fulfill 

their main role of variance stabilizers in the importance sampling weights 
while the mixture concentrates on the medium variances, with a quick con- 
vergence of the mixture weights to the limiting weights — the accumulated 
weights of the 5th, 6th, 7th and 8th components of the mixture converge to 
0, 0.003, 0.259 and 0.738, respectively. The fit of the simulated sample to 
the target distribution is shown in Figure 3, since the points of the sample 
do coincide with the (unique) modal region of the posterior distribution. 
This experiment also shows that there is no degeneracy in the samples pro- 
duced: most points in the last sample have very similar posterior values. For 
instance, 20% of the sample corresponds to 95% of the weights, while 1% of 
the sample corresponds to 31% of the weights. A closer look at convergence 
is provided by Figure 4, where the histograms of the resampled samples are 
represented, along with the distribution of the log likelihood and the empiri- 
cal cumulative distribution function cdf of the importance weights. They do 
not signal any degeneracy phenomenon, but, rather the opposite — a clear 
stabilization around the values of interest. 



7. Conclusion and perspectives. This paper shows that it is possible 
to build an adaptive mixture of proposals aimed at a minimization of the 
Kullback divergence with the distribution of interest. We can therefore set 
different goals for a simulation experiment and expect to arrive at the most 
accurate proposal. For instance, in a companion paper [9], we also derive an 
adaptive update of the weights targeted at the minimal variance proposal 
for a given integral 3 of interest. Rather naturally, these results are achieved 
under strong restrictions on the family of proposals in the sense that the 
parameterization of those families is restricted to the weights of the mixture. 
It is, however, possible to extend the above results to general mixtures of 
parameterized proposals, as shown by work currently under development. 



Table 1 
Two-by-two contingency table 
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Total 





60 


364 


424 
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36 


240 


276 


Total 


96 


604 


700 
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s 




> 



Pi likelihood & weights Pi likelihood & weights 




Fig. 4. Evolution of the samples over four iterations of the Rao-Blackwellized D-kernel PMC sampler for the contingency table example 
(the output from each iteration is a block of four graphs, to be read from left to right and from top to bottom): histograms of the resampled 
samples of ati, /3o and f3i of size 50,000 and (lower right of each block) log-likelihood and the empirical cumulative distribution function 
of the importance weights. 
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A more practical direction of research is the implementation of such adap- 
tive algorithms in large-dimensional problems. While our algorithms are 
in fine importance sampling algorithms, it is conceivable that mixtures of 
Gibbs-like proposals can take better advantage of the intuition gained from 
MCMC methodology, while keeping the finite-horizon validation of impor- 
tance sampling methods. The major difficulty in this direction, however, is 
that the curse of dimensionality still holds in the sense that (a) we need to 
simultaneously consider more and more proposals as the dimension increases 
(as, e.g., the set of all full conditionals) and (b) the number of parameters 
to tune in the proposals exponentially increases with the dimension. 

APPENDIX A: CONVERGENCE THEOREMS FOR TRIANGULAR 

ARRAYS 

In this section, we recall convergence results for triangular arrays of ran- 
dom variables (see [4] or [10] for more details, including the proofs). We will 
use these results to study the asymptotic behavior of the PMC algorithm. 
In what follows, let {UN,i} N>i,i<i<N be a triangular array of random vari- 
ables defined on the same measurable space (Q,A) and let {Gn}n>i be a 
sequence of cr-algebras included in A. The symbol Xn — >p a means that 
Xn converges in probability to a as N goes to infinity. 

Definition A.l. The sequence {UN,i}N>i,i<i<N is independent given 
{Gn}n>i if for all N > 1, the random variables Un,i, ■ ■ ■ , Un,n are indepen- 
dent given Qn. 

Definition A. 2. The sequence of variables {Zn}n>i is bounded in 
probability if 

lim supP[|Zjv| > C] = 0. 

C^ootv>1 

Theorem A.l. // 

(i) {UN,i}N>i,i<i<N is independent given {Gn}n>i~, 

(ii) the sequence {J2iLi^[\UN,i\\GN]}N>i is bounded in probability; 

(iii) for allnX), Eli E[\U N 'i\ I ]UnM>t1 \Gn] — p 0, 

then J2f=i(U N ,i ~ E[U N ,i\g N ]) 0. 

Theorem A. 2. // 

(i) {Un,i} N>i,i<i<N is independent given {£//v}iV>i; 

(ii) for allN> l,Vi€ {l,...,iV}, E[\U N)i \ \G N ] < oo; 
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a 



2. 



(iv) for all r, > 0, Eli Ep^u^g. 



0, 



i/ten /or a/l it £ M, 



E 



pi tit 2_^(Un,i -E[u N ,i\g N ]) 



> P exp 



u 2 a 2 



APPENDIX B: PROOF OF PROPOSITION 3.1 



We proceed by induction with respect to t. Using Theorem A.l, the case 
t = is straightforward as a direct consequence of the convergence of the 
importance sampling algorithm. 

For t > 1, let us assume that the LLN holds for t — 1. Then to prove that 
J2iLi&i,th(xij,Kij) converges in probability to -K®^ u (h), we need only 
check that 



v 



1 «K M (£i ) t-lj»i,t) 



h(x i: t,K itt ) -=htp 7T <g) 7«(/t), 



the special case h = 1 providing the convergence of the normalizing constant 
for the importance weights. Applying Theorem A.l with 



U N4 =N' 1 



n(x it ) 



qK it (xi,t-i,Xi,t) 



h(xij,K itt ) 



and 



■ (r{(xi,t-i)i<i<N, ( a d N ) 



Kd<D 



}• 



where a{(Xi)i} denotes the cr-algebra induced by the Xi's, we need only 
check condition (iii). For any C > 0, we have 



v 



Y 1 ^ E 



i=i 



(12) 



qKi, t (xi,t-i,Xi,t 



qK it (xi,t-i,Xi,t) 



h(xi,t,K it t) > C 



Qn 



D 



N 



t,N 
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where Fc(x,k) = J Tr(du)h(u, k)I{ q ^^ u ) h(u, k) > C}. By induction, we have 



t,N 



N 



a. 



i=l 



N 



N- 1 J2 Fc{Zi,t-i,k) ^ n(F c (; k)). 

i=l 



Using these limits in (12) yields 



N 

i=l 
N—>ac 



Tr(xi,t)h(xi t t,Ki t t)^ Tr(x it t)h(xi t t, K^ t ) c 



qK iit (xi,t-i,Xi,t) I QK t , t (xi,t-i,Xi,t) 



Since 7r (8) "y u (Fc) converges to as C goes to infinity, this proves that for 
any rj > 0, 



N 



i=l 



ir(x iit )h(xi t t, K i)t ) f -K(x it t)h(x it t, K itt ) >N 



qK it (xi,t-i,Xi,t) I qK it (xi,t-i,Xi,t 



Qn 



0. 



Condition (hi) is satished and Theorem A.l applies. The proof follows. 

Acknowledgments. The authors are grateful to Olivier Cappe, Paul Fearn- 
head and Eric Moulines for helpful comments and discussions. Comments 
from two referees helped considerably in improving the focus and presenta- 
tion of our results. 



REFERENCES 

[1] Agresti, A. (2002). Categorical Data Analysis, 2nd ed. Wiley, New York. 
MR1914507 

[2] Andrieu, C. and Robert, C. (2001). Controlled Markov chain Monte Carlo methods 
for optimal sampling. Technical Report 0125, Univ. Paris Dauphine. 

[3] Cappe, O., Guillin, A., Marin, J. and Robert, C. (2004). Population Monte 
Carlo. J. Comput. Graph. Statist. 13 907-929. MR2109057 

[4] Cappe, O., Moulines, E. and Ryden, T. (2005). Inference in Hidden Markov Mod- 
els. Springer, New York. MR2159833 

[5] Celeux, C, Marin, J. and Robert, C. (2006). Iterated importance sampling in 
missing data problems. Comput. Statist. Data Anal. 50 3386-3404. 

[6] Chopin, N. (2004). Central limit theorem for sequential Monte Carlo methods and 
its application to Bayesian inference. Ann. Statist. 32 2385-2411. MR2153989 

[7] CsiSZAR, I. and Tusnady, G. (1984). Information geometry and alternating min- 
imization procedures. Recent results in estimation theory and related topics. 
Statist. Decisions 1984 (suppl. 1) 205-237. MR0785210 

[8] Del Moral, P., Doucet, A. and Jasra, A. (2006). Sequential Monte Carlo sam- 
plers. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 411-436. MR2278333 



ADAPTIVE MIXTURES FOR IMPORTANCE SAMPLING 



29 



[9] Douc, R., Guillin, A., Marin, J. and Robert, C. (2005). Minimum variance 
importance sampling via population Monte Carlo. Technical report, Cahiers du 
CEREMADE, Univ. Paris Dauphine. 
[10] Douc, R. and Moulines, E. (2005). Limit theorems for properly weighted samples 
with applications to sequential Monte Carlo. Technical report, TSI, Telecom 
Paris. 

[11] Doucet, A., de Freitas, N. and Gordon, N., eds. (2001). Sequential Monte Carlo 
Methods in Practice. Springer, New York. MR1847783 

[12] Gilks, W., Roberts, G. and Sahu, S. (1998). Adaptive Markov chain Monte Carlo 
through regeneration. J. Amer. Statist. Assoc. 93 1045-1054. MR1649199 

[13] Haario, H., Saksman, E. and Tamminen, J. (1999). Adaptive proposal distribution 
for random walk Metropolis algorithm. Comput. Statist. 14 375-395. 

[14] Haario, H., Saksman, E. and Tamminen, J. (2001). An adaptive Metropolis algo- 
rithm. Bernoulli 7 223-242. MR1828504 

[15] Hesterberg, T. (1995). Weighted average importance sampling and defensive mix- 
ture distributions. Technometrics 37 185-194. 

[16] Iba, Y. (2000). Population-based Monte Carlo algorithms. Trans. Japanese Society 
for Artificial Intelligence 16 279-286. 

[17] KijNSCH, H. (2005). Recursive Monte Carlo filters: Algorithms and theoretical anal- 
ysis. Ann. Statist. 33 1983-2021. MR2211077 

[18] Mengersen, K. L. and Tweedie, R. L. (1996). Rates of convergence of the Hastings 
and Metropolis algorithms. Ann. Statist. 24 101-121. MR1389882 

[19] Robert, C. (1996). Intrinsic losses. Theory and Decision 40 191-214. MR1385186 

[20] Robert, C. and Casella, G. (2004). Monte Carlo Statistical Methods, 2nd ed. 
Springer, New York. MR2080278 

[21] Roberts, G. O., Gelman, A. and Gilks, W. R. (1997). Weak convergence and 
optimal scaling of random walk Metropolis algorithms. Ann. Appl. Probab. 7 
110-120. MR1428751 

[22] Rubin, D. (1988). Using the SIR algorithm to simulate posterior distributions. In 
Bayesian Statistics 3 (J. M. Bernardo, M. H. DeGroot, D. V. Lindley and 
A. F. M. Smith, eds.) 395-402. Oxford Univ. Press. 

[23] Sahu, S. and Zhigljavsky, A. (1998). Adaptation for self regenerative MCMC. 
Technical report, Univ. of Wales, Cardiff. 

[24] Sahu, S. and Zhigljavsky, A. (2003). Self regenerative Markov chain Monte Carlo 
with adaptation. Bernoulli 9 395-422. MR1997490 

[25] Tierney, L. (1994). Markov chains for exploring posterior distributions (with dis- 
cussion). Ann. Statist. 22 1701-1762. MR1329166 



R. Douc 

cmap, ecole polytechnique, cnrs 
Route de Saclay 
91128 Palaiseau cedex 
France 

E-MAIL: douc@cmapx.polytcchniquc.fr 



A. Guillin 

Ecole Centrale de Marseille et LATP, CNRS 

Centre de Mathematiques et Informatique 

Technopole Chateau-Gombert 

39 rue F. Joliot Curie 

13453 Marseille cedex 13 

France 

E-MAIL: guillin@cmi.univ-mrs.fr 



30 



R. DOUC, A. GUILLIN, J.-M. MARIN AND C. P. ROBERT 



J.-M. Marin 

INRIA Futurs, Projet Select 
Laboratoire de Mathematiques 
Universite d'Orsay 
91405 Orsay cedex 
France 

E-MAIL: Jean-Michcl.Marin@inria.fr 



CP. Robert 

Ceremade, Universite Paris Dauphine 

75775 Paris cedex 16 

France 

E-MAIL: xian@ceremade.dauphinc.fr 



